Web scraping has become popular in scientific research, especially in statistics. Preparing an appropriate IT environment for web scraping is currently not difficult and can be done relatively quickly. Extracting data in this way requires only basic IT skills. This has resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level within the national statistical institutes, and on the international one by Eurostat. The aim of this paper is to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions.
For the sake of the analysis, a case study featuring large-scale web scraping performed in 2022 by means of big data tools is presented in the paper. The results from the case study, conducted on a total population of approximately 503,700 websites, demonstrate that it is not possible to provide reliable data on the basis of such a large sample, as typically up to 20% of the websites might not be accessible at the time of the survey. What is more, it is not possible to know the exact number of active websites in particular countries, due to the dynamic nature of the Internet, which causes websites to continuously change.
big data, web data, websites, web scraping
C55, L86, M21
Anglin, K. L. (2019). Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing. Journal of Research on Educational Effectiveness, 12(4), 685–706. https://doi.org/10.1080/19345747.2019.1654576.
Antonov, O., & Laktionova, O. (2020). Evaluation of Real Estate Market Value in Ukraine Using Web-Scraping. Galician Economic Journal, 63(2), 35–44. https://doi.org/10.33108/galicianvisnyk_tntu2020.02.035.
Ascheri, A., Marconi, G., Meszaros, M., & Reis, F. (2022). Online Job Advertisements for Labour Market Statistics using R. Romanian Statistical Review, (1), 3–26. https://www.revistadestatistica.ro/2022/03/online-job-advertisements-for-labour-market-statistics-using-r/.
Boegershausen, J., Datta, H., Borah, A., & Stephen, A. T. (2022). Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing, 86(5), 1–20. https://doi.org/10.1177/00222429221100750.
Cavallo, A., & Rigobon, R. (2016). The Billion Prices Project: Using Online Prices for Inflation Measurement and Research. Journal of Economic Perspectives, 30(2), 151–178. https://doi.org/10.1257/jep.30.2.151.
Daas, P. J. H., & van der Doef, S. (2020). Detecting Innovative Companies via their Website. Statistical Journal of IAOS, 36(4), 1239–1251. https://doi.org/10.3233/SJI-200627.
Daas, P. J. H., Puts, M. J., Buelens, B., & van den Hurk, P. A. M. (2015). Big Data as a Source for Official Statistics. Journal of Official Statistics, 31(2), 249–262. https://doi.org/10.1515/jos-2015-0016.
Dogucu, M., & Çetinkaya-Rundel, M. (2020). Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities. Journal of Statistics and Data Science Education, 29(sup1), 112–122. https://doi.org/10.1080/10691898.2020.1787116.
European Commission. (n.d. a). ESSNet Big Data. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/content/essnet-big-data-1_en.
European Commission. (n.d. b). ESSNet Big Data II. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/essnet-big-data-2_en.
European Commission. (n.d. c). Experimental big data statistics. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/content/Experimental_big_data_statistics_en.
European Commission (n.d. d). Web scraping policy. Retrieved April 21, 2023, from https://cros-legacy.ec.europa.eu/content/item-04-web-scraping-policy_en.
European Commission. (n.d. e). Trusted Smart Statistics – Web Intelligence Network. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/WIN_en.
European Commission. (2022a). Deliverable 2.1: WP2 1st Interim Progress Report. https://cros.ec.europa.eu/system/files/2023-12/wp2_deliverable_2_1_wp2_1st_interim_progress_report_20220331_revision_2.pdf.
European Commission. (2022b). Report: URL finding methodology. https://cros-legacy.ec.europa.eu/system/files/20220131_url_finding_methodology.pdf.
Khder, M. A. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and its Applications, 13(3), 144–168. https://doi.org/10.15849/ijasca.211128.11.
Krotov, V., & Tennyson, M. (2018). Research Note: Scraping Financial Data from the Web Using the R Language. Journal of Emerging Technologies in Accounting, 15(1), 169–181. https://doi.org/10.2308/jeta-52063.
Nasiboglu, R., & Akdogan, A. (2020). Estimation of the Second Hand Car Prices from Data Extracted via Web Scraping Techniques. Journal of Modern Technology & Engineering, 5(2), 157–166. http://jomardpublishing.com/UploadFiles/Files/journals/JTME/V5N2/NasibogluR.pdf.
Oancea, B., & Necula, M. (2019). Web scraping techniques for price statistics – the Romanian experience. Statistical Journal of the IAOS, 35(4), 657–667. https://doi.org/10.3233/SJI-190529.
Office for National Statistics. (n.d.). Web Scraping Policy. Retrieved August 17, 2022, from https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/datapolicies/webscrapingpolicy.
Orbis. (n.d.). Overview [Data set]. Retrieved April 28, 2023, from https://www.bvdinfo.com/en-gb/our-products/data/international/orbis.
Palys, T. (2008). Purposive sampling. In L. M. Given (Ed.), The Sage Encyclopedia of Qualitative Research Methods, Vol. 2 (pp. 697–698). Sage. https://doi.org/10.4135/9781412963909.
Pegueroles, P., Guerrero, R., Fernández, A., & López, D. (2021). Price’s Index through of Web Scraping. Revista Chilena de Economía y Sociedad, 15(1), 32–54. https://rches.utem.cl/wp-content/uploads/sites/8/2022/01/revista-chilena-de-economia-y-sociedad-vol15-n1-2021-Pegueroles-Guerrero-Fernandez-Lopez.pdf.
Polidoro, F., Giannini, R., Lo Conte, R., Mosca, S., & Rossetti, F. (2015). Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Statistical Journal of the IAOS, 31(2), 165–176. https://doi.org/10.3233/SJI-150901.
Schedlbauer, J., Raptis, G., & Ludwig, B. (2021). Medical informatics labor market analysis using web crawling, web scraping, and text mining. International Journal of Medical Informatics, 150, 1–9. https://doi.org/10.1016/j.ijmedinf.2021.104453.
Wirthmann, A., & Reis, F. (2021). The Web Intelligence Hub – A tool for integrating web data in Official Statistics. 63rd ISI World Statistics Congress, Online. https://cros-legacy.ec.europa.eu/sites/default/files/isi_-_web_intelligence_hub_eurostat_paper.pdf.